Linear Regression
Core Concept
Linear regression models the target variable as a linear combination of the features plus an intercept (bias). For a single target (y) and feature vector (\mathbf{x}), the model is (y = \mathbf{w}^\top \mathbf{x} + b) (or (y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p)). The learned hyperplane in feature space minimizes a loss over the training set—typically sum of squared errors (SSE) or mean squared error (MSE)—yielding a unique closed-form solution (normal equation) when the design matrix is full rank. This represents the foundational regression approach: interpretable coefficients, fast training, and a single global fit that is well-understood statistically and serves as a baseline for more flexible methods.
Key Characteristics
- Closed-form solution – Under squared-error loss, the optimal weights are given by the normal equation (\mathbf{w} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}), assuming (\mathbf{X}^\top \mathbf{X}) is invertible. No iterative optimization is required for the basic formulation, though gradient descent is used for large-scale or regularized variants.
- Interpretability – Each coefficient (\beta_j) can be read as the expected change in the target per unit change in (x_j), holding other features constant. Sign and magnitude of coefficients support feature importance and causal-style reasoning, subject to correlation and confounding.
- Single global fit – One set of weights applies everywhere in the feature space; the model cannot capture different slopes or curvature in different regions unless features are engineered (e.g. interactions, polynomial terms) or the model is extended (e.g. piecewise linear).
- Assumptions – Classical inference (standard errors, confidence intervals) assumes linearity, independence of errors, homoscedasticity (constant error variance), and often normality of errors. Violations affect inference more than the fitted values; robust or heteroscedastic-consistent standard errors can relax some assumptions.
- Regularization – Ridge (L2) and Lasso (L1) add penalties on (\mathbf{w}), shrinking coefficients or performing feature selection; they improve generalization when (p) is large or features are correlated and are solved by iterative optimization (e.g. gradient descent) rather than the standard normal equation.
Common Applications
- Demand and sales forecasting – Predicting quantity sold or revenue from price, promotion, seasonality, and other covariates
- Housing and asset valuation – Estimating price from size, location, number of rooms, and similar attributes
- Risk and exposure modeling – Predicting continuous risk scores or exposure levels from demographic and behavioral features
- Trend and time-index regression – Modeling a quantity as a linear function of time or an index when the relationship is approximately linear
- Causal and policy analysis – Estimating treatment effects or policy impacts when linearity and identification assumptions hold; coefficients support interpretable comparison across groups or conditions
- Baseline and residual analysis – Using linear regression as a simple baseline; examining residuals to guide feature engineering or choice of more flexible models